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Abstract 

Every day, billions of mobile network events (commonly defined as Call De¬ 
tailed Records, or CDRs) are generated by cellular phone operator companies. 
Latent in this data are inspiring insights about human actions and behaviors, the 
discovery of which is important because context-aware applications and services 
hold the key to user-driven, intelligent services, which can enhance our everyday 
lives. This potential has motivated preliminary research activities in a variety of 
domains such as social and economic development, urban planning, and health 
prevention. The major challenge in this area is that interpreting such a big stream 
of data requires a deep understanding of mobile network events’ context through 
available background knowledge. Two of the most important factors in the events’ 
context are location and time. This article addresses the issues in context awareness 
given heterogeneous and uncertain data of mobile network events missing reliable 
information on the context of this activity. The contribution of this research is a 
model from a combination of logical and statistical reasoning standpoints for en¬ 
abling human activity inference in qualitative terms from open geographical data 
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that aimed at improving the quality of human behaviors recognition tasks from 
CDRs. We use open geographical data, Openstreetmap (OSM), as a proxy for 
predicting the content of human activity in the area. The user study performed 
in Trento city, Italy shows that predicted human activities (high level) from OSM 
match the survey data with around 93% overall accuracy, with fuzzy temporal 
constraints. The extensive analysis of the model validation for predicting a more 
specific—economical—type of human activity performed in Barcelona city, Spain, 
by employing credit card transaction data. This data gives us ground-truth infor¬ 
mation on what types of economic activity are occurring. The analysis identifies 
that appropriately normalized data on points of interest (POI) is a good proxy for 
predicting human economical activities, with 84% accuracy on average. So the 
model is proven to be efficient for predicting the context of human activity, when 
its total level could be efficiently observed from cell phone data records, missing 
contextual information however. 

Index terms — mobile phone data records, big geo-spatial data, human activ¬ 
ity recognition, human mobility, context aware computing, qualitative methods, 
human behavior, semantics, ontology, machine learning, knowledge management, 
linked open data 


1 Introduction 

Increasingly, massive amounts of data are being generated, stored, and disseminated as 
a result of human activity. For instance, whenever a mobile phone call, monetary trans¬ 
action, or social media post is made, geo-located data is automatically generated by mo¬ 
bile network provider, bank, or social network provider (e.g., Facebook or Twitter) and 
attached to the data record generated by the activity. An extensive body of works lever¬ 
age for studying human dynamics through cell phone data |[T] |27] [bO], social 

media posts Il28ll4n . bank card transactions llM1l62ll and vehicle GPS traces 
Over the last few years, the use of mobile phones as sensors of human behavior has rad¬ 
ically increased. Storage and analysis of information from mobile phones can provide 
useful insights into how people move and behave. From them, it is possible to infer 
with a certain level of accuracy, the activities humans are performing at every moment 
they are connected to the mobile network. Indeed, experiments in large-scale social dy¬ 
namics have been conducted in the areas of public safety and emergency management 
in 1^ 133, health and disease management mM, social and economic development 
ifTl [1811221 . transport/infrastructure |[5l [32]| . urban planning 13EH 1121ES[38]HD and 
international development, poverty ifS^ and more Q. A large fraction of mobile phone 
data has been shown to be extremely useful for humanitarian and development appli¬ 
cations (Robert Kirkpatrik UN 2013). This geo—and—time information have explicit 
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or implicit geographic location. 

A commonly used source of mobile phone data for which studies is comprised of 
aggregated and anonymized Call Detailed Records (CDRs), provide meta data about 
phone activity. This allows us to identify the time, the location, the duration, and possi¬ 
bly of the movements of mobile phones owners. Because the mobile network operator 
knows the locations of their cell towers, it is possible to use CDRs to approximate the 
location of users. The spacing of cell towers, and thus the accuracy in determining 
caller’s locations, varies according to expected traffic and terrain. Cell towers are typ¬ 
ically spaced 2-3 km apart in rural areas and 400-800 m apart in densely populated 
areas. Commonly the CDRs are available as anonymous, often incomplete or aggre¬ 
gated because of privacy issues. Accessing personal or private data is a very sensitive 
and critical topic, which arises claims to be aware of privacy issues before the research 
outcomes are shared. The main research area of this article is the one, which investi¬ 
gates inferring (collective) human behavior starting from such CDR. 

Current research E |8] 123 |45l notes that with a pure CDR, it is possible to identify 
human behaviors, but results suffer from the heterogeneity, uncertainty and complexity 
of raw datasets and that the lack of qualitative content is included in the data itself that 
may be used to help to infer human behaviors. In spite of the bad quality of data, some 
researchers are able to identify a certain level of human behaviors with the help of 
features that can be indirectly obtained from the raw CDR. These features are observed 
information about real-world cases. For example, a user’s home location is identified 
by the frequent stay (location) of the user between midnight to early morning IH. These 
features are mostly specified by domain experts. Most of the behavioral datasets, such 
as CDR or GPS traces, are relatively large raw datasets where user location is not 
perfect for tracking the dataset. Such data are incomplete, typically lacking in content 
and the measurement accuracy of the data is low and coarse grained. While this data 
provides extensive information regarding when and where people go, it typically does 
not allow understanding of which activity is performed at that location. 

When the context information of raw data is not directly available in the data, some 
useful characteristic values for human behaviors can still be discovered by analyzing 
the activity patterns of mobile network communication mni |26l IMl- For example, 
anomalous events can be observed by the change of daily communication activity pat¬ 
terns nni. Using some semantics about ongoing events (e.g., the location and the time 
of an event), researchers IS [8l [8] |23] |3T1 are able to discover some relationships be¬ 
tween human behaviors (i.e., communication and mobility patterns), and emergency 
(e.g., earthquake, blackout) or non emergency events (e.g., concert, festival). However, 
the results are limited to only specific events and the correlations between the events 
and human behaviors. For instance, the quantitative analysis of behavioral changes in 
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presence of extreme emergency events show radical increase of call frequency right 
after the event occurs, with long term impacts 0. 

Human behavior is influenced by many external factors, such as weather condition, 
urban structure, social event. Knowing the context, where a person is important to 
guess what he/she is doing. For instance, if a person is in or close to a restaurant and 
makes a phone call around 1:00 pm., or 8:00 pm., then the person is probably look¬ 
ing for eating some food, or (s)he works at the restaurant. In this article, we propose 
to formalize the context as a pair where I is a location (= geographical area) 

and f is a time interval. A context can be associated with various information. For 
instance, weather, events, surrounding geo-objects, and more. For every record in a 
CDR dataset, it is possible to determine to a certain level of approximation the con¬ 
text associated to such a record. Some research studies ll24l [30l l45l l46l l46l [65l have 
been performed to understand the correlations of human behaviors to environmental 
factors, using some additional contextual information about weather, social events and 
geographical information systems. This focuses on the inference of a certain level of 
human activities in a certain condition. For instance, human behavior changes during 
periods of uncomfortable weather condition 1^ . 

Among the literature of raw CDR analysis, researchers mostly employ data-mining 
and machine learning algorithm^ ifTTl l50l in the analysis. Supervised learning algo¬ 
rithms (e.g.. Multiclass SVM, Logistic Regression, Multilayer Perceptron, Decision 
Tree, KNN) can achieve reasonable accuracy on the CDR analysis but they require 
qualitative and quantitative properties in data and ground truth information in order to 
perform behavior classification well. The researchers claimed to acquire a sufficient 
amount of labeled training data. For their studies, but obtaining ground-truth infor¬ 
mation (e.g., user diaries and surveys) for training data remains a very expensive and 
almost unmanageable task, especially when one considers the large amounts of CDR 
data that need to be annotated and the diversity in the contexts of mobile phone events 
that makes the annotation more difficult. 

The result of CDR analysis usually provides quantitative evidence that are pre¬ 
sented in visual analytic tools (e.g. graphs and tendencies), and requires a qualitative 
description for the human behavior interpretation. Less interest has been dedicated to 
the development of methods that are capable of producing a qualitative/semantic level 
description of human behavior. This thesis will concentrate on providing qualitative 
descriptions of human behavior. For representing human behavior, we propose to use 
semantic descriptions based on an ontology. More in details, a semantic description of 
human behaviors is a representation of the behaviors of a single person, the behaviors 
of a group of people or the events that happen in the human society, in terms of con- 

'http://en.wikipedia.org/wiki/SupervisedJeaming 
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cepts and relations of an ontology. An example of a semantic description of human 
behaviors is in the fact that “a person is performing some specific actions” (e.g. work¬ 
ing, shopping and hiking), or the fact that “certain events are happening in a certain 
area” (e.g. a car accident and a train suddenly stopping in the middle of nowhere). 

Growing information in a variety of Web 2.0 and social platforms opens opportuni¬ 
ties for collecting and using information about the context associated to mobile phone 
data records. These sources include a wide and diverse set of data that can be useful to 
characterize a user’s context: environmental data; statistic description of the territory; 
public and private, emergency and non-emergency events; statistical data about de¬ 
mography, ethnography, energy or water consumption; and more. These kinds of data 
should be employed as contextual information if we are to gain a better understanding 
of a user’s context, as they allow a more representative and semantically expressive 
characterization of the relationships between human behavior and different contextual 
factors. 

This article seeks understanding human behavior based on contextual information 
obtained (also) by leveraging available Web 2.0 data sources, representing them in both 
a conceptual and computational model, such as the correlations between contexts and 
human behaviors, and prediction of human activities of a group of people will perform 
in a given context, and the most probable action that a group of people are doing in 
a context. The contextual information can be inaccurate, unavailable, uncertain and 
noisy. In our research, we need to pre-estimate if the information is useful enough 
by measuring the validity, accuracy, and suitability of the data. Such contextual data 
needs to be effectively integrated, coping with the problems caused by uncertainty and 
heterogeneity of data. 

We identified that points of interest (POIs) are good proxies for predicting the con¬ 
tent of human activities in the context in which mobile network events occur. Our pro¬ 
posed model is named High Level Representation of Behavior Model (HRBModel), 
and correlates POIs and the time of the day with typical user activities. The model in¬ 
tegrates an ontology for POIs and times/days with an ontology for human activities and 
statistical information about the correlations between the two. Given a set of POIs and 
a time of the day (which can both be inferred from the context), the model generates a 
set of human activities associated with a likelihood measure. 

We validated the accuracy of the model using two different qualitative geo-located 
data-sources, providing us ground-truth information on what type of human activities 
were performed in a given location: 1) user feedback collected in Trento, Italy and, 
2) bank card transaction data generated in Barcelona, Spain. Our extended evaluation 
of the model includes a validation of the impact of heterogeneity and uncertainty of 
open geographical data for recognizing human activities. This validation is performed 
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on city scale and takes different land-use types into account in order to understand the 
level of accuracy of human activity prediction from the geographical data or vice versa 
from bank card transaction data. 

The innovative aspect of the research is the development and usage of ontologies 
for analysis of raw CDR data, combined in a mixed model with data mining numerical 
methods. Our method improves the quality of human activity recognition tasks given 
noisy, lossy, and uncertain data, and allows to infer, with a certain level of accuracy, the 
different level (of hierarchy) of activities humans are performing at a specihc location 
and time. Accordingly, the model can deal with a wide range of contextual features 
about objects of a territory by extracting concepts and inter-concept relations. 

Another innovative aspect is that the HRBModel is a general model that can be used 
to provide semantic interpretation of any type of geo-located and time-dependent data, 
other than CDR, such as data provided by social platforms like Foursquare, or Twitter, 
GPS, and Credit Card data. Potential applications of this research are a context aware 
application systems, and “smart cities” applications that provide decision support for 
stakeholders in areas such as urban, transport planning, tourism and event analysis, 
emergency response, health improvement, community understanding, economic indi¬ 
cators and others EKHIia. 

The remainder of the paper is structured as follows: Section|^briefly summarizes 
relevant literature. In the following section, we introduce the data-sources we used 
for contextual information. The core methodology of the HRBModel to extract high 
level human activities from geographical information is explained in Section]^ The 
experimental results and evaluation of the model are presented in Section]^ In Section 
we discuss the usage of the model and describe possible developments (extensions). 
Finally, in Section]?] we draw some conclusions. 

2 Related Works 

Great opportunity and impact on human activity recognition is brought by mobile 
phone data. Through data mining and machine learning techniques, human activity 
data becomes available for further analysis. 

Researchers in the areas of behavioral and social sciences are interested in examin¬ 
ing CDR to characterize and understand real-life phenomena, such as individual traits, 
human mobility |]2l]31Ill]Sl> communication and interaction pattern ifSl [39l l49ll . Can- 
dia et al nni proposes an approach to understand dynamics of the individual calling 
activities, which could carry implications on social networks. The author analyzed call¬ 
ing activities of different group of users; (some people rarely use mobile phone, others 
use often). The cumulative distribution of consecutive calls made by each user is mea- 
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sured within each group and the result explains that the subsequent time of consecutive 
calls is useful to discover some characteristic values for the behaviors. For example, 
peaks occur near noon and late evening. The fraction of active traveling population 
and average distance of travel are almost stable during the day. This approach can be 
applied for detecting anomalous events. In |[8l, the authors analyze the mobility traces 
of groups of users with the objective of extracting standard mobility patterns for people 
during special events. In particular, this work presents an analysis of anonymized traces 
from the Boston metropolitan area during a number of selected events that happened 
in the city. They finally proved that people who live close to an event are preferentially 
interested in events organized in the proximity of their residence. 

Some researchers used semantic tags for geographic location from social networks 
or user diaries to identify a semantic meaning of places Il^l54ll57ll67ll^ captured in 
user CDRs or GPS trajectories. Supervised learning algorithms (e.g. binary SVM and 
hidden markov models) are employed for this identification. 

People perform different activities even when they stay in the same location at the 
same time, with respect to the wide difference of situations caused by various factors 
like natural, technological or societal disasters, from hurricanes to violent conflicts. 
This requires the consideration of potential influencing context factors on a common 
space-time basis that enables the associations of those different datasets. Indeed, some 
researchers EI EH HU use an additional information about context factors in or¬ 
der to study the relationship between human behaviors and context factors like social 
events, geographical location, weather condition, etc. This is always as successful as 
the quality of the context factors. 

In fact, some researchers validated the use of spatial and temporal contextual fea¬ 
tures for understanding the relationship between human behaviors and environmental 
factors. For instance, Sagl and Resch et al. El found that many factors influence 
the collective human behavior and weather conditions and geographic information are 
certainly two of these factors. Therefore, associating environmental and social fac¬ 
tors to mobile phone data is extremely useful for analyzing the dependency of human 
behaviors from the external factors. These studies focus on the inference of a cer¬ 
tain level of human activities under certain conditions like human behavior changes in 
uncomfortable weather conditions Il30ll4^l46ll65l . human mobility of different com¬ 
munities Goiiia, human mobility during an event EEiiia, or communication activity 
patterns in different land-use types ll24ll^l45llMI . In this work, geographical informa¬ 
tion in particular points of interest (POIs) are collected using pYsearch (Python APIs 
for Y! search services) from a map. They annotated POIs with four type of activities; 
eating, recreational, shopping and entertainment. The authors analyzed human activ¬ 
ity patterns (i.e., sequence of area profiles visited by users) correlating to geographical 
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profiles. Bayesian method is used to classify the areas into crisp distribution map of 
activities, which enables the activity pattern extraction of the users. The results shows 
that the users who share the same work prohle follow the similar daily activity patterns. 
But not only these area prohles can explain the mobility of these users and the enlarge¬ 
ment of activity or event taxonomy in those areas can enable the classihcaions of these 
activity patterns. Due to the limitation of heterogeneity of activities considered in this 
research, the results in measurement lacks of certainty. 

The authors of m hnd that residents are more attracted to events that are organized 
close to the home location of the residents. F. Girardin et al. Il24l analyzed aggregate 
mobile phone data records in New York City to explore the capacity to quantify the 
evolution of the attractiveness of urban space and the impact of a public event on the 
distribution of visitors and on the evolution of the attractiveness of the points of in¬ 
terest in proximity. In ll52]| . the authors introduced an approach to provide additional 
insights in some interactions between people and weather. Weather can be seen as a 
higher-level phenomenon, a conglomerate that comprises several meteorological vari¬ 
ables including air temperature, rainfall, air pressure, relative humidity, solar radiation, 
wind direction and speed, etc. The approach has been signihcantly extended to a more 
advanced context-aware analysis in 15^ . 

3 Data gathering 

In this study, we model one contextual location dataset of 4.6 million POIs in Trento, 
Italy, 2.6 million POIs from OSM in Barcelona, Spain against two 130K POSs from 
Banco Bilbao Vizcaya Argentaria (BBVA), and CDRs from GSM 900 and GSM 1800 
(Telecom Italia). Here, we describe the structure of mobile network events and to 
investigate how mobile network data can be combined with the available contextual 
information for supporting human recognition task. We show that location and time are 
the key factors to integrate mobile phone data records with the contextual information 
coming from the other sources about environments and events. 

3.1 Mobile phone data records 

There are many possible types of CDRs generated by operator companies. In the fol¬ 
lowings we describe the ones we are using to use in this article. The initial experiments 
of this thesis have been done on a CDR dataset (completely anonymized) provided by 
Telecom Italia with the following structure: 

source cell and target cell: the identihers of the cell from which the call is issued and 
the receiver’s cell, respectively. They are composed of a Location Area Code 
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and additional parameters, such as the operator, etc. Cell ID needs to be resolved 
into a physical geographical area 

date and time The date and the time of the day in which the call is issued 
duration The duration of the call 

To combine mobile phone data records with contextual information, location and 
time associated to each single call are key. CDR are not directly geo-referenced but 
they are labelled with the cell. From the cell, and the geographical cell distribution 
we can reconstruct with a certain level of accuracy and estimation of the geographical 
location of the user when he or she generated a particular mobile event. 

As an example, here we describe the structure of cell coverage map that gener¬ 
ated by Telecom Italia. We use aggregate GSM 900 and 1800 data that was provided 
by Telecom Italia. The cell coverage map of the mobile network is captured across 
Trento containing completely anonymized cell (antennas). Cell identihcation is used 
for identihcation of some portion of a physical geographical area featured within a set 
of devices (antennas) that support the mobile communication (e.g. calls, sms and in¬ 
ternet). The size of coverage areas can vary depending on the estimation of call traffic 
from/to this area. Usually the size of the coverage areas are inversely proportional to 
the density of the population inhabiting the area. However there is an upper-bound of 
the coverage area size due to antennas physical limits to 35 km. The cell partition is 
not hxed and can vary depending on the demand of network during specihc periods. 
For instance to support an increased request of communications, additional cells can be 
dynamically activated in order to increase the capability of the entire network. Partial 
knowledge of mapping between cell ID’s and coverage areas, irregularity of the size 
of the cell, and dynamically of the cell conhguration add complexity and uncertainty 
on the identihcation of the physical location from where a call is issued. Nowadays, 
cell coverage maps are restricted to be accessed due to privacy issues. Yet, some open 
sources for cell coverage maps are available on the web sites, such as Open Cell- ieQ 
and Open Signal Map^ 


3.2 Credit card data 

Credit card data is raw data semantically enriched with some semantic information 
which describes the type of transaction that will be used in our analysis in order to 
evaluate the proposed approach. The descriptions of which are provided for research 
purposes by one of the largest Spanish banks, BBVA. The raw dataset is protected by 

^http://www.opencellid.org 

^http://opensignal.com 
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the appropriate non-disclosure agreement and is not publicly available. The data was 
completely anonymized on the bank’s side, preventing access to any personal informa¬ 
tion of the customers in accordance with all applicable privacy protection regulations. 
Each transaction is characterized with its value, a timestamp, and additionally the retail 
location, where it was performed at the Points of Sale (POS) terminals. Each POS is 
categorized into one of the 76 business activity types, such as groceries, fashion, bars, 
restaurants, sport shops, etc. 

3.3 Contextual data 

Openstreetmap (OSM) is an open and free map of the world enriched with a large num¬ 
ber of geo-referenced objects upon an open content license. The database of OSM is 
frequently updated and is rapidly growing. It has a free tagging system based on a well 
documented taxonomy, which classifies objects in categories, such as roads, buildings, 
etc. OSM adopts a topological data structure based on the three core elements: 

Nodes: points with a geographic coordinate (i.e., a pair of latitude and longitude). A 
node mostly describes a POL Eor example, shops, restaurants, etc. 

Ways: ordered lists of nodes in a poly-line or a polygon. These elements are used for 
representing linear features or areas, such as streets and forests. 

Relations: groups of nodes and ways that represent the relations among existing nodes 
and ways, such as roads with several exiting ways. 

Each of aforementioned three element types is described with geographic coordi¬ 
nates. Nodes are mainly used to represent POIs, but they can also be used as part of 
ways to represent linear features. So a POI can be described in a node or in nodes, 
and part of a way, for instance, a railway level crossing. The metadata of OSM is 
provided in the form of tags (i.e., a textual description described as a pair of key and 
value) to describe each element. Nodes, ways, and relations are tagged, and that the 
tag represent some information about the node, for instance if it is a station, a shop, a 
road, a crossing, etc. A list of common tags used (but not all) is captured in the Map 
Eeatures pag^ The tags for nodes, ways, and relations are represented as a set of 
pairs of {key, value). Eor example, a university is tagged as that the key is amenity, 
the value is university, and the highway route is tagged as that the key is route, the 
value is highway. The OSM tags for one node corresponding to a supermarket in XML 
representation are shown below:: 

^http://wiki.openstreetmap.org/wiki/Map .Features 
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<node id="618033185" version="3" uid="330007 


user="pik:appa7 9 " 

tiinestamp="2 011-05-2 6T1 6 : 02 :14Z" 

changeset="8254868" 

lat="46.0946011" lon="ll.1162507"> 

<tag k;="addr :postcode" v="38121"/> 

<tag k="shop" v="supermarket"/> 

<tag k="addr:country" v="lT"/> 

<tag k="name" v="Supermercato PAM"/> 

<tag k="addr:housename" v="Bren Center"/> 

<tag k="addr:street" 

v="Via Giovanni Battista Trener"/> 

<tag k="addr:city" v="Trento"/> 

</node> 

The way we intend to use OSM in this article is by collecting the objects which are 
in the cell area. These objects can provide insights on the type of action a person while 
he is located in a cell area. For instance, if a person is in or close to a restaurant and 
makes a phone call around 1:00 pm., or 8:00 pm., then the person is probably looking 
for eating some food. Alternatively or complementary, we could use Google Map or 
Yahoo Map or regional cadastral map database for collecting the POIs within the each 
cell area. A relevant technique for POIs extraction in a given region from geographical 
information system is described in ifTsl |45]| . 

Table [T] introduces a brief description of ontology we used to generate useful in¬ 
formation from the aforementioned heterogeneous data sources. For the integration 
between these qualitative data-sources we designed an ontology for each domains used 
in this PhD research work: 1) POI, 2) Action, 3) Time. This ontology is based on 
surveys, crowdsourcing, and domain experts, or the formally defined classifications for 
instance, business career classification by YellowPages or sport classification by the 
Olympic Game Committee, etc. 


4 High-level representation of behavior model 


This section introduces a prediction model, called High Level Representation of Be¬ 
havior Model (HRBModel) fla. This model generates a set of human activities with 
a likelihood from the set of POIs. The structure of the model is visualized in Figure 


1(a) The HRBModel consists of two core components: the Human Behavior Ontology 


(HBOnto) and the Stochastic Behavior Model (SBM). The HBOnto is an ontology that 
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Table 1: Input and Output cardinalities 


Ontology 

Input 

Input 

Cardinality 

Output 

Output 

Cardinality 

OSMonto 

OSM 

1822 

POIs 

517 

ActOnto 

Survey, Crowdsourcing 
YellowPages 

420 

actions 

217 

TimeOnto 

Survey, experts 

72 

fuzzy times 
days 

27 

HBOnto 

OSMonto, ActOnto 
TimeOnto 

52 classifiers 

actions 

parent-10 
all-217 


provides a classification of POIs, human actions, and periods of the day, and it formal¬ 
izes the relations between these three sets of elements. For example, at the university, 
people are studying in the morning and afternoon. The SBM provides a likelihood 
measure to each of the actions that has been selected by the ontology, on the basis of 
the frequency of POIs and on a fuzzy model of the actions depending on the time of 
the day. It allows us to predict the possible top-fc activities associated with a likeli¬ 
hood, that could be performed in each context of a given region. For example, having 
a breakfast activity is the most probable activity at residential types of area in the early 
morning. 

In the remainder of the chapter, we describe in details the two components of the 
HRBModel; HBOnto in Section [4~T] and SBO in Section |431 In Section [4^ we show 
the application of the HRBModel. 


4.1 Human behavior ontology (HBOnto) 


The Human Behavior Ontology (HBOnto) provides a set of possible activities for a 
given POI type or in the time of day, i.e., it does a priori activity selection on the 
basis of POIs and time of the day. For instance, it excludes “sleeping” if the time of 
the day is 1 ;00 pm. and it excludes “hiking” if you are in the city center. It includes 
activities only if there are POIs that activates them. For example, an area containing 
a swimming pool activities swimming activity. The HBOntcj^ is composed of three 
ontologies (see Figure l(b)| i. OSMOnto, which contains the various classes of POIs 
as classified in OSM; ActOnto, which contains the different types of human activities 
we are interested in; TimeOnto, which contains temporal references in which those 
human activities can occur. These three ontologies are related by two main relations. 
The first one connects an action class to the classes of POIs, in which the action can be 
performed. The second relation connects an action class to the classes of the time of 


^http://brenta.disi.unitn.it/ dashdorj/Ontology/HBOnto.zip 
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day during which the action is possible. In the following, we describe each ontology 


of HBOnto (Sections 4.1.1 4.1.2 4.1.3| l, the relations among them (Sections 4.1.4 and 
|4.1.5| l and how DL reasoning on HBOnto can be used to determine possible activities 
at a given POI type and time (Section|4~2]i. 



(a) The prototype framework of the HRBModel 


(b) Composition of human behav¬ 
ioral ontology (HBOnto) 


Figure 1 


4.1.1 The Ontology of POl—OSMonto 

A POI denotes a geo-referenced object, such as a restaurant, a shop, or a lake or other 
objects, whih have a precise geographical location. In this study, we extended the ex¬ 
isting OSMonto ontology 02, which proposes a classification of POIs, by adding new 
POI types that have recently been appended to OSM (see, for instance as well as 
the new frequent POI types found in our analysis. The (extended) OSMonto ontology 
is a taxonomy of POI types, that formalizes the information about POIs contained in 
OSM. The POIs are encoded in the OSM as in the form of a pair {key, value). This 
information is transformed in order to be represented in OSMOnto. The key is a type 
of POI that includes a prefix and the value is a sub type of POI that includes a 
prehx “u_”. However, according to the OSM tagging structure about POIs, some POIs 
might have the same “u_” value denoting different sub-types based on the “fc_” key. 
For instance, a POI {railway, train), and a POI {route, train) both have the same 
values. In this case, we represent those types of POIs as the combination of the key and 
the value, for instance in kjrailway-V-train. An example of the classes of POIs in 
OSMOnto is shown in Figure]^ We added manually the recent POI types appeared in 
OSM. In total, 517 POI types are used in OSMonto. We discard and exclude the irrel¬ 
evant POIs, for example, those do not reflect a human action like benches, chimneys, 
power towers, and trash bins. 
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▼ •POI 

► •k_aerialway 

► •k_aeroway 
▼ •k_amenity 

• k_amenity_v_swimming_pool 
•v_airport 
#v_arts_centre 
#v_atm 
#v_bank 
•v_bar 
#v_bench 
#v_bicycle_parking 
#v_bicycle_rental 
•v_biergarten 
•v_bureau_de_change 
•v_bus_station 
#v_cafe 
#v_car_rental 
♦v_car_sharing 
• #v_car_wash 
#v_casino 
•v_charglng_station 


Figure 2: OSMOnto ontology for POI classes 


4.1.2 The Ontology of Human Activity—ActOnto 

The ActOnto ontology is a taxonomy of human activities. ActOnto contains totally 
217 human activity classes hierarchically organized in different levels up to four lev¬ 
els. The upper level is composed of 10 activity types (see the Table [^. Various 
sources are exploited to classify those activities, such as Yellow Page^ Human Re¬ 
source Management] and Olympic Garnet] and we also asked for support from do¬ 
main experts. The ontology has been constructed manually. Figure shows the top 
two levels of the human activities taxonomy. We use description logic to formalize 
the hierarchy of human activities with axioms. For example, traveLby.airplane C 
travel.by.transport, travel.by.car Q travel.by.transport 


4.1.3 The Ontology of Time—TimeOnto 

Human activities are also correlated to time and, in particular, to the time of day and 
the day of week. For example, in early morning, people have breakfast and in the af¬ 
ternoon people have lunch. For the task of activity detection, we are more interested in 
a qualitative representation of time than fine-grained time measurements. The Time¬ 
Onto ontology is designed for modeling time in qualitative terms (e.g., late morning), 
organizing them into a containment hierarchy (e.g., morning includes late morning). 
The qualitative terms fuzzify numerical time periods. A representation of how fuzzy 

®http://www.yellowpages.ca/business/ 

’http://mayor2.dia.fi.upm.es/oeg-upm/files/hrmontology/hrmontology-RDF.zip 
*http://swat.cse.lehigh.edu/resources/onto/olympics.owl 
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Table 2; Human activity categories in ActOnto 


Top level classes of activities 

Type of POIs 

eating 

fast food, food court, restaurant, cafe 

shopping 

grocery, general stores 

health medicine activity 

hospital, pharmacy 

entertainment activity 

bar, casino, movie, theater 

education activity 

library, university school 

transportation traveling 

airplane, bus, car, train 

outdoor activity 

sightseeing, personal care, religious places 

sporting activity 

car racing, summer, winter sports 

working activity 

professional work place, industrial place 

residential activity 

guest house, hotel, hostel, residential building 


times/days are hierarchically organized is given by following axioms in description 
Logic: 


morning C time 

earlyjmorning Q morning 

midjmorning Q morning 


weekday C day 
holiday Q day 
Saturday Q holiday 
Sunday Q holiday 


In total, 17 classes of fuzzy time and 10 classes of fuzzy day are described in the 
ontology. 

4.1.4 Relation between POIs and Human activities 

A POI activates a set of human activities that people can perform when they are in 
that object, or in its neighborhood. For instance, in a restaurant people usually perform 
eating activity. On a highway road, traveling by transportation is the most usual activity. 
Every poii is related with a set of activities A{poii) C {oi, 02 ,.., an} which can be 
performed or hosted there or nearby. For example, on a highway road, traveling by 
car is the most usual activity. The relationship between human activities and POIs 
is represented with the object property “what.candbejlone”. The assertion of this 
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Figure 3: Hierarchy of human activity classes are visualized in two levels down from 
the top level of human activity classes in ActOnto in which the nodes are the top level 
of activities and the leaves are the second level of activities 


relationship 1291 is an existential restriction that is exemplified by the following axiom, 
which states that for every highway road there is an activity of the type traveling by car, 
which is possible: v-highway C 3what-canjbe-done.travelingJby-car. As a result, 
we defined around 151 associations between human activities and corresponding POI 
types. 

4.1.5 Relation between Human activities and Time periods 

The activities of a user are highly influenced by his location. For instance, if a person is 
close to a university, the most probable activities are studying, teaching, and working. 
In order to capture this dependency, we first need to model which activities can be 
performed or hosted within or nearby every POI (e.g., eating is possible in a restaurant. 


16 















while traveling is possible on a railway). Different activities have different timetables; 
these timetables may depend on a particular context/culture, e.g., the city of Trento. 
Then we performed the experimental evaluation. 

• shopping activity [9 ;00 am, 12:00 pm.], [3:00 pm.,6:00 pm.], except Sunday. 

• eating activity [8:00 am.,2:00 pm.], [6:00 pm.,10:00 pm.]. 

• entertaiment activity [8:00 pm.,3:00 am.]. 

Every human activity type is related with a set of times/days A{ti) C {oi, 02 ,.., a„}. 
This relationship between time of the day and human activities is represented with an 
object property, ‘Hs.usually.done.during” and the relationship between days and hu¬ 
man activities is designed as an object property ‘Hs.usually.done.on". The assertion 
of these relationships is an existential restriction, which is specified using the following 
example axiom in DL: shopping C activityr\3is.usually.done.during.{morningr\ 
afternoon) n 3is.usually.done.on.{weekday n Saturday). This axiom should be 
read as “every shopping activity can be possibly performed in the morning and in the 
afternoon except on Sunday”. We attempt to associate time periods to all actions using 
the above type of axiom accordingly. As a result, all human activities in HBOnto are 
associated with the day of the week and the dime during the day. 

4.2 Usage of the HBOnto ontology 

We explain the formulas of the ontology for deriving of a set of human activities that 
could be performed in a certain POI during a certain time of the day and day of the 
week. We use an axiom for deriving the activities. We use a DL reasoner, for instance, 
FACTh-h- reasoner in Pellet 

4.2.1 Deriving Human Activities in a Given POI type 

For deriving a set of human activities from a certain POI, there is a boolean valid¬ 
ity function VAL(poZi, ai) that indicates whether an activity is valid or is not in poii 
which exploits the reasoning of the axioms defined, as the following example axiom: 

V.highway C 3what.can.be.done.traveling.by.car. 

4.2.2 Deriving Human Activities at a certain time and day 

For extracting a set of human activities in a certain time of the day and day of the 
week, there is a validity boolean function 'VAL{ai,ti,di) that indicates whether an 
activity is valid or is not at ti in di. The function is derived from the reasoning axioms 

®http://clarkp arsia.com/pellet/ 
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Ass«rt«d class hierarchy ' Inferred class hierarchy | 

► ©sunday_mornjng_activlty ± 

► ©sunday_night_activity 

f ©workday_afternoon_actlvity 

▼ u workday_early_afternoon_actjvity 
© doing a house^activity 

► #education_actlvlty 

© having_sightseeing_trip 

► ©outdoor_other_actlvity 

► ©personal_care — 

^ ©relaKation_activity 

► ©religlous_activlty 

► ©ser\/lces_and_buslness 

► ©shopping 

► ©sporting 

► ©travelling_by_transport 
©visiting_a_clinic_hospita) 
©visiting_a_gallery 
©visiting_a_museum 
©visiting_a_pharmacy 
©visiting_a_special_care 

► ©working 

T ©workdayJate_afternoon_activity 
©doing_a_house_actlvity 

► ©education_activity 

© having_sightseeing_trip 

► ©outdoor_other_activity 

► ©personal_care 

► ©relaxation_activity 

► ©reiigious_activity 

► ©services and business T 


^ Class Annotations Y Class Usage ] 


I Annotation#, workday earK,- afternoon activity (S5ll)[il| 



|Pe#crtp£on^<or1cda^earii^ftemoor^ctM^^^^^^^^^^^^^MBBm 


Equivalent classes 

• activity 

and is_usually_done_on some workday 
and is_usually_done_during some 
early_afternoon 

Superclasses 

O workday_afternoon_activity 


I nferrcd anonymous superclasses 

© activity 

and is_usuaiiy_done_on some workday 
and is_usuaiiy_done_during some afternoon 

Members 

Disjoint classes 


Figure 4: Example of a classifier atomic formula for human activities on weekday early 
afternoons 


called derived facts. For example, in order to derive what are the activities that are 
usually done during the early afternoon of workdays as the following example axiom: 

workday -early _afternoon-activity = activity □ 

3is-Usually-done-during.early-afternoon □ 3is -usually -done-on.workday 
We modeled 52 axioms for this reasoning. For example, a reasoning on human 
activities in the early afternoon of weekday is shown Figure]^ 


4.3 Stochastic behavior model (SBM) 


The HBOnto ontology provides a set of possible actions, a stochastic model (SBM) - 
the other component of the HRBModel we propose - tries to rank the actions on the ba¬ 
sis of the number of POIs of certain type and the distance from them, and a fuzzy model 
from the time of the day. Thus we compute as following: P{a\l, t, d) = P{a\t)*PJ^a\t,d) ^ 


where conditional independence of I and time periods (f, d) in given a. Section 4.3.1 
explains the estimation of the P{a\t, d) based on the relative importance of action in 


certain time period t of d employing Fuzzy model. Section 4.3.2 describes the esti¬ 
mation of the P{a\l) based on the relative importance of POIs in location I employing 
TF-IDF. Section |4.3.3 shows how this approach can be generalized to P{a\l, r) in or¬ 
der to account for POIs that are in a larger neighborhood of the user to make the model 
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more robust. P{a) is the total probability of actions occur in all locations and all certain 
time periods of all days. 


4.3.1 Likelihood of human activity in a given time/day 


The probability of human activity that happens at a certain time and day is P{a\t, d) 
that is a probability computed using the Fuzzy Reasoning model, as described in the 
following formula: P{a\t^d) = ’ where S{x) is a fuzzy controller which 

estimates the area of the fuzzy values (fuzzy sets are represented in trapezoidal curve), 
FM (a) is a trapezoidal fuzzy membership function for possible values of action set a; 
a A, A={eating, working, studying,...} and FM{t, d) is a trapezoidal fuzzy member¬ 
ship function for possible values of timeset t,tGT, T={early morning, mid morning. 


late morning, mid day,..,late night}. For example. Figure 5(a) illustrates that the fuzzy 
sets for the morning time (interval [6:00 am., 11:00 am.], for the having breakfast activ¬ 
ity (interval [6:00 am., 10:00 am.]), for the afternoon time and so on. In this example, 
the probability of having breakfast in the morning is the intersection between the fuzzy 
set of morning time period and the fuzzy set of having breakfast. This intersection im¬ 
plies the probability of the action occurring during a certain time period and it is com- 

22 

puted as the trapezoid area estimation in the fuzzy controller: S{x) = ^ 

i=0 

where h is the height of the trapezoid and equal to 1, are the fuzzy set values of 
action a; a G A, A = {eating, working, studying,...}, ti are the fuzzy set values of time 
period t,tGT, T= {early morning, mid morning, late morning, mid day,..,late night}, 
i is a parameter for time interval by hour (i=0,l,2...,22). 



(a) Fuzzy intervals for morning time (blue), 
having breakfast activity (red), midday (yel¬ 
low), afternoon (purple), evening (green) 



(b) Minimum cartesian distance between the 
centroid point of the circle with the aggregation 
radius r and the nearby areas 


Figure 5 
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4.3.2 Likelihood of Human Activity from a set of POIs in a Given Location 

POIs are the most important factors affecting the likelihood of activity occurrence. 
But not all the POIs has the same weight. Some POIs, such as bus stops, small 
shops, ATM, etc., which frequently occur in several areas, have a minor discrimina¬ 
tive influence on the likelihood estimation than very distinctive POIs, like an airport 
or a swimming pool. Borrowing the approach commonly used in the Information Re¬ 
trieval, we estimate the importance of POIs in a given area by the Term Frequency- 
Inverse Document Frequency (TF-IDF) 1551 function. In this case, the Term Fre¬ 
quency (TF) factor measures the frequency of a POI in a given area, while the Inverse 
Document Frequency (IDF) factor gives an indication of the general discriminative 
power of a POI. A high IDF factor is associated with POIs that are rare while a low 
IDF factor is associated with POIs that are very common and thus are of low use¬ 
fulness for distinguishing among different areas. For these common POIs, the ratio 
inside the logarithm of the IDF formula approaches 1, bringing the TF-IDF closer to 
0. More in details, the weight of the POIs in location I is estimated by the formula, 
tf-idf(/,() = * log ipgi^lgqp where / is a given POI; / € F, 

F={building, hospital, supermarket,...} and I is a given location; I G L, L={locationl, 
Iocation2, locations,...}, N{f, 1) is the number of occurrence that POI/ appears in lo¬ 
cation / and argmax{A^(r(;, () : w G 1} is the maximum occurrence of all the POIs 

W 

in location I, \L\ is the number of all locations, \{l G L : f G (}| is the number of 
locations, where POI / appears. This way, after we assign a weight to each POI in 
a given area, we retrieve a set of human activities relevant to those POIs in the area 
by performing reasoning on the HBOnto ontology. A weight of POIs is propagated 
to the relevant actions, and the total weight of a given action is estimated as follow: 

, where a is the action in location I, f is the element of 
POI types which derive action a, |A(/)| is the number of elements (actions) in the set 
A{f), and F{a) is the set of POI types associated to the action a. The probability of 
human activity that happen in location I is a probability given by P{a\l) = , 

where W{a, 1) is the weight of human activity a that occurs in location t, I G L, L = 
{location!, Iocation2, locations,...}, and W{1) is the total weight of all activities that 
occur in location /. 

4.3.3 Likelihood of human activity from a set of POIs in or nearby locations 

In some cases, the correct activity for a user cannot be predicted properly based on the 
POIs immediately close to the user. This happens, for instance, if the user is moving 
and starts a phone call before reaching his intended destination where the activity will 
take place. To tackle these cases, we propose an extension of the stochastic model that 
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also takes into account POIs that are not located in the user location but can be find 
considering a larger neighborhood. 

Starting with the centroid of a location, we consider POIs in locations within 100 
m, as a default aggregation radius of the point. If the total number of POPs in those 
locations is at lower than h=5Q (as a default), we expand the aggregation radius by 25 
m until the number of POIs in the intersecting locations is higher than h, or until the 
aggregation radius reaches to maximum radius of 3.000 m which was chosen consis¬ 
tently aligned with the maximum coverage of, for instance, mobile phone cells. As 


an example. Figure 5(b) shows that four nearby locations fall inside the aggregation 
radius r of the centroid point of a given location. The weight of the activities in such 
areas is estimated depending on the intersecting weight [0,1] of the nearby locations 
fall inside the circle: k G L{l,r). So we extend P{a\l) using aggregation radius r, 
which is the likelihood estimation of the human activities in given location I as the for- 

J2 {W{a,li)*\i} 

mula: P{a\l,r) = {tv(a,,z.)*A.} , where W (a, 1) is the weight of action 

liGUl.r) aje/i(ii) 


a in location I, and A{li) is the set of actions in location li. The Xi is the intersection 
weight [0,1] that is measured by the minimum cartesian distance between the centroid 
point of the circle with radius r given by the centroid point of location I and the closest 
point of nearby areas Ai = 1 — where di is the distance of nearby location 

i, in which the distance is equal or lower than the r of the circle. For instance, if a 
nearby location falls in the circle, the intersection weight is 1, if not, it is lower than 
1. Given Xi, the {W (a, li) * A^} is the activity weight in the intersecting locations that 
fall inside the circle. The following example introduces the human activities with a 
likelihood measure, which could be performed in the morning of workday for a given 
set of POIs in a location as shown in Table |3] 


POIs 

Count 

Weight 

Activities in a given location 

Activities in the morning of workday 

Probability 

{tourism, hostel) 


0.3 

relaxing at home 
having breakfast 
having lunch 
having dinner 

relaxing at home 
having breakfast 

0.1 

0.1 

{natural, tree) 

10 

1.5 

hiking 

hiking 

0.6 

{highway, busstop) 


0.02 

traveling by bus 

traveling by bus 

0.01 


Table 3: The HRBModel inference for human actions in the morning of workday 


4.4 Usage of the HRBModel 

The HBRModel estimates the probability of an action to happen in a region on the bases 
of the frequency of POIs. POIs are bias across areas. Because some locations (i.e., the 
center of the city) have many POIs that could derive many actions and locations (i.e., 
the remote area) have few POIs that could lead to few actions. So, we cannot divide 
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a region in areas of the same dimension, but we have to take into account the number 
of POIs included in an area. We consider the areas to be well characterized in terms 
of POIs. We divide a territory into grid cells(=locations) in order to have uniformly 
distributed POIs across locations. We consider Openstreetmap (OSM) as the source of 
POIs, as it is open and free. We hrst describe how POI information (human activity 
relevant) can be extracted from OSM for a given location. Then we described how a 
territory can be organized into locations satisfying certain properties on the basis of 
available location and POI data. An default and spatial rectangular grid is designed for 
populating the representative POIs (i.e., related human activities) at each location. The 
area of a region or territory is divided into unit areas, where each unit size of the grid 
is 50 and in each of that we populate the POIs. 

We build a density-based POI distribution grid for a territory which have a thresh¬ 
old number of POIs in each cell(=location) of the grid (in a default conhguration, the 
aggregation radius is 0). The default grid is then re-partitioned using Quad-Trees ll20l 
in a way that each unit area (=location) contains h G [10,20] number of POPs. The 
Quad-trees are commonly used to partition two dimensional space by recursively sub¬ 
dividing it into four quadrants or regions. The regions might be square or rectangular, 
or might have arbitrary shapes. We define this re-partitioned grid as a density-based 
POI distribution grid, in which the locations in urban spaces have a smaller size of 
coverage as the POI density is higher and the locations in remote spaces have a larger 
size of coverage as the POI density is lower. In our analysis, we discard and exclude 
the irrelevant POIs, for example, those do not reflect any human activity like benches, 
chimneys, power towers, and trash bins. 

5 Evaluation 

In this section, we performed three types of evaluation for the applicability of the pre¬ 
dictive model, HRBModel, taking various types of spatial and temporal data and their 
factors into account; 

• Evaluation through user feedback 

• Evaluation through bank card transaction data revealing actual human economic 
activities 

• Extended evaluation of the model 
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5.1 Evaluation through user feedback 


We designed an experimental application for collecting user feedback data to validate 
our HRBModel. The user feedback was collected in the city of Trento, Italy. We then 
present the experimental outcome and evaluation results. The goal of the experiment 
was to understand whether human activities associated with a given area at a given 
time in our fuzzy model are correct, activities are evaluated with respect to the feed¬ 
back provided by a set of users. The application recommends a list of top-k possible 
human activities at each user-chosen context and then users identified evaluate the cor¬ 
rect activity he/she performed from the list. The application shows the map of Trento 
overlaid by the density-based POI distribution grid of the mobile network coverage 
map. The interface of the application is illustrated in Figure The figure shows the 
most probable activity in top-k is predicted to the user. The preliminary analysis of this 
evaluation was introduced in our previous work UB- in this work, we improved the 
fuzzy reasoning model and re-analyzed the geographical area with richer content. 



Figure 6: Visualization of demonstration app, at Piazza Duomo on weekday mornings 

In building this application, hrst of all, we needed to build a density-based POI 
distribution grid for this city; for this we collected 4.6 million POIs extracted from 
the OSM. Using Quantium Gisf*^ the map of the city has been divided into a grid 
of dimensions 401 x 302 cells in the beginning, where each unit size of the grid is 
50 m^. The cells are partitioned using quad-tree in a way that each unit area (i.e., 
location) contains h G [10,20] size of representative POIs. The re-partitioned locations 
containing threshold h number of POIs are represented in Figure [7(b)1 In total, 3,150 
locations were processed. Using the data from OSM, we populated the POIs at each 

*®www2.qgis.org 
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location with those contained in, crossed over located on the border of, or intersected 
with that location. In total 135,918 relevant POIs were extracted across the locations; 
this number was reduced to 31,514 after cleaning and discarding irrelevant POIs (i.e., 
those that do not reflect relevant human activity). 



(a) POI distribution in Trento 


(b) Density-based POI distribution grid 
in Trento 


Figure 7 


Figure 8(a) shows the comparison of number changes of the items (i.e. raw and 
rehned POI and extracted human activities) across locations in scatter plot. The rehned 
number of POIs are 10 times lower than the number of raw POIs after the cleaning and 
discarding irrelevant POIs. The number of human activities extracted from the rehned 
POIs are 5 times lower than the raw POIs after the activity inference. The density 
distributions of number of POIs and human activities across locations are represented 


in Figure 8(b) where the density curves are Gaussian types. This hgure shows that 
across locations, the minimum, median, and maximum number of POI are 15, 38, and 
375, and the minimum, median, and maximum number for human activity are 1, 9, 
and 59. The remote areas are described in the left tail of the distribution that contain a 
small number of POIs and human activities, up to the mode of the density distribution 
of POIs 37, and human activities 9. The dense areas are described in the right tail of 
the distribution that contain the number of POIs and human activities are up to 375 and 
59, respectively. 

To populate a sufficient number and diversity of POIs in each location, we consider 
the nearby areas for estimating the likelihood of human activities. The nearby areas are 
the intersected locations within the aggregation radius of the centroid point at each lo¬ 
cation. The aggregation radius is configured differently for each location; this satishes 
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(a) The changes of the number of refined POIs 
(green) and extracted human activities (reddish) 
compared to the raw POIs, aggregation radius=0 



The number of human activities, 

POI's in each location 

(b) The distributions of POIs and human activities 
across locations, aggregation radius=0 (i.e. default 
configuration) 


Figure 8; The representative POIs and human activity distribution 


the need for the total number of POIs in such intersected locations to be above thresh¬ 
old h (see Figure |^, where each location has at least h(=5Q) number of POIs in the 
intersected locations. Across locations, the minimum, median, and maximum number 
of POIs are 14, 55, and 268. The distinct activities and parent level of distinct activities 
are estimated in each location using the aggregation radius. The main types of POI in 
Trento city include building, land-use, amenity, and highway types (see Figure 10(b) i 
and the main human activities extracted from these POIs are mostly related to eating, 
residential, and work activities (see Figure [T0(a)| l. 

Our web application is able to recommend a set of top-k human activities in a given 
location and time of the user selection. However, multiple activities that have the same 
priority can occur at the same time. In this experiment, we do not evaluate the order 
of activities in the prediction due to the limitation of fine-grained ground-truth data we 
use. So we concentrate only on the “fuzzification” of the activity prediction, which 
evaluates if one of the activities among top-k prediction performed by the application 
is correct or not. However, we can further evaluate the order of the activity prediction 
if the ground-truth data we use is fine grained. So, a user evaluation among the first 
k predictions can be an initial evaluation to measure the accuracy of the model perfor¬ 
mance. In any of the above locations, upon selection, the user is prompted with a list of 
top-k human activities associated with the time of day, having the highest likelihood. 
The user is then asked to select those activities that he or she has actually performed by 
in that area at that time. The user is also allowed to select an activity not in the top-k 
list among all the available activities, in case none of the above fits. 
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Figure 9: POI and the relevant distinct human activity distribution using aggregation 
radius 



(a) Distribution of POI categories across loca¬ 
tions of the POI distribution grid, and of lo¬ 
cations of the POI distribution grid containing 
POI categories, aggregation radius=0 


0.45 

Locations containing activity categories 
Activity categories across locations 



(b) Distribution of activity categories across lo¬ 
cations of the POI distribution grid, and of lo¬ 
cations of the POI distribution grid containing 
activity categories, aggregation radius=0 


Figure 10; Semantics of POIs and activity distribution over Trento, Italy 


We collected user feedback through the web application described above for one 
week with 32 participants involved. We then analyzed the collected user feedback 
in order to measure the accuracy of the HRBModel. We measured the accuracy of the 
model. For the latter, we took into consideration only the areas with the highest amount 
of feedback. Downtown and Povo. 

To measure the accuracy of the approach, we consider the prediction of an activity 
correct if the user selected it in the web application regardless of its ordering posi- 
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tion. Within the total of 481 items of users’ feedback collected, if we consider a set 
of the top-8 activities in the recommendation, the overall correctly predicted activity 
percentage is 68% (A:=8). There, we propagated the probability activities to their child 
activities. We increased the prediction granularity of the activity to k=n, where n is 
the total number of distinct activities, and the overall prediction percentage was raised 
to 93%. By changing the granularity of activity level to high-level (parent) human ac¬ 
tivities, the prediction accuracy is increased up to 80% (k=8), and 97% (k=n). When 
we ignore the time for estimating the likelihood of human activities, the accuracy per¬ 
centage was lower 57% ik=8) but stable at 97% for k=n. This might be because some 
human activities are tightly related to the location and have nothing to do with time. 
For example, if there is an open football stadium, people tend to play there at any time 
they like. 

So the overall prediction percentage is affected by how we choose the threshold top- 
k of the prediction granularity. This might be a reason that the prediction percentage as 
well as its granularity can vary at the different local scales. However, the user feedback 
did not have enough distribution over the city. So we selected two geographical areas; 
the center of the city and Povo. In these locations, we have sufficient feedback to eval¬ 
uate the model at the different local scales. The center of the city is characterized with 
different types of activity, while the Povo is mostly characterized with studying and 
research work activities as our participants are researchers. We estimated the predic¬ 
tion accuracy in the center of the city, and the result shows that the high-level (parent) 
activity prediction percentage is 98% (k=n) but 95% in Povo. This shows that there is 
an attraction in the city center, so the participants who share a similar work profile or 
similar study profile go there to perform similar activities. This result matches the find¬ 
ing of Phithakkitnukoon et al ll45l . which showed that a daily pattern of human activity 
(sequence of human activity) strongly correlates with a certain type of geographic area 
that shares a common characteristic context. 

5.2 Evaluation through credit card data revealing human econom¬ 
ical activities 

We employed user feedback for evaluating the model in the previous section, but the 
ability to reflect actual human activity is not perfect (note the aforementioned limita¬ 
tions of user feedback data we collected in Trento city). Here, we introduce another 
source of data that could serve as a direct proxy for human economical activity for a 
more sophisticated evaluation of our HRBModel. We use bank card transaction data 
collected in Barcelona, Spain which were performed at POS terminals by users. The 
type of business activity at every POS terminal, for instance, at a department store or 
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a restaurant, etc., can reveal the human activities that can be performed at such a lo¬ 
cation. With the use of this data, we evaluate the performance of our HRBModel in a 
way similar to the evaluation with user feedback in the previous section. 



(a) Density-based POI distribution grid in 
Barcelona 


(b) (Default configuration) POI and human activity distribu¬ 
tion across the locations in Barcelona, aggregation radius=0 

Figure 11 


First, we also needed to build a density-based POI distribution grid for the city of 
Barcelona, Spain and for that we collected 2.6 million POIs extracted from OSM in a 
rectangular grid with dimensions 252 x 393, where the unit size of each grid is 50 m^. 
Further, we re-partitioned the grid using the quad-tree approach to ensure that each 


location contains h G [10,20] POIs. Final partitioning is presented in Figure 11(a) 


In total, 3,853 locations were processed. Using the data coming OSM, we populated 
the POIs that were contained in, crossed over, were located on the border of, or in¬ 
tersected each location. In total, 197,289 activity relevant POIs were extracted across 
the locations, which was reduced to 53,529 after the cleaning and discarding of irrel¬ 
evant (POIs those that do not reflect any human activity). Figure 11(b) summarizes 


the POI Altering results compared to the raw POIs. In this case the POIs are 5 times 
lower compared to the number raw POIs in each location. The figure also shows the 
corresponding human activities to POIs that were generated after the raw POI cleaning 
process. The POI and human activity density distribution across the locations are rep¬ 
resented in Figure |1 l(b)[ where the density curve is skewed in the center. The figure 
also shows that across locations, the min, median, and max number of POI are 1, 47, 
and 545, and the min, median, and max number for human activity are 1, 13, and 
46. The remote areas contain a small number of POIs and human activities, up to the 
mode of the density distribution of POIs 32 and human activities 12. In dense areas. 
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The number of POIs 
in each location 

Figure 12: POI and the relevant distinct human activity distribution considering aggre¬ 
gation radius 


the number of POIs and human activities are up to 545, and 46, respectively. These 
POIs and human activities are the representative activities in each location. 

To populate a sufficient number and diversity of POIs in each location, we consider 
the nearby areas for estimating the likelihood of human activities. The nearby areas 
are the intersected locations within the aggregation radius of the centroid point at each 
location. The aggregation radius is configured differently in each location, which sat¬ 
isfies the need for the total number of POIs in such intersected locations to be above 
the threshold h (see Figure [T^ where each location has at least h=50 number of POFs 
in the intersected locations. Across locations, the minimum, median, and maximum 
number of POIs are 18, 55, and 162. The distinct activities and parent level of distinct 
activities are estimated in each location using the aggregation radius. 

The main types of POI in Barcelona city are highway, building, and route types 
(see Figure [T3(b)[ ) and the main human activities extracted from these POIs are eating, 
residential, and sporting (see Figure 13(a)| l. 

We performed the evaluation of the model to measure the accuracy of the approach 
in same way as we have done in Trento city, but here we selected random POS locations 
in Barcelona; assume that each POS generates correct human activities. Every POS 
is associated with a business category: therefore, 76 types of businesses are given 
for all the POSs, such as travel, food, hypermarkets, hotels, real estate, automation, 
bars and restaurants, personal care, sport and toys, technology, home, content, fashion, 
leisure, health, transport, etc. Since we use both POIs and POSs in the data sets in 
our evaluation, we adjust them so that they are comparable to avoid a semantic gap 
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(a) Distribution of POI categories across lo¬ 
cations of the POI distribution grid, and 
of locations of the POI distribution grid 
containing POI categories, aggregation ra- 
dius=0 


0.35 

^ Locations containing activity categories 
0.30 ^ Activity categories across locations 



(b) Distribution of activity categories across 
locations of the POI distribution grid, and of 
locations of the POI distribution grid contain¬ 
ing activity categories, aggregation radius=0 


Figure 13: Semantics of POIs and activity distribution over Barcelona 


between the data sets. For instance, when a customer buys groceries in a supermarket, 
POS interprets that a customer is doing shopping in grocery stores. The POIs and 
POSs use different activity type taxonomies, so they need to be mapped to a common 
taxonomy for comparability. We define consistent activity types for POS using the 
human activities described in ActOnto. We discard POIs and POSs that have no activity 
type reflected in it, such as ATM use. 

For evaluating the HRBModel, we compared the human activities of a given POS 
with the set of top-k human activities generated by HRBModel in the same context 
of POS location and time, using the aggregation radius of nearby areas for estimating 
the likelihood of human activities, as we described in Section We performed this 
validation by selecting 1,000 random POSs in the city. The overall accuracy of parent 
level activity prediction percentage is 82% (A:=8) and 92% {k=n). When we ignore the 
time, the parent level activity prediction percentage is stable at 92% {k=n). This also 
shows that some human activities are tightly related to the location and have nothing to 
do with time. However, the random selection of POS is not a sophisticated choice for 
analyzing the granularity of the prediction task in the city scale. We need to estimate 
the accuracy fraction at the local scales to know where the HRBModel gives better 
accuracy and good precision. 
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5.3 Extended evaluation of the model 


In this section, we consider an extensive validation of the HRBModel in order to esti¬ 
mate if POI data can be a good proxy for predicting such a specialized type of activity 
(economical activity) at the local scale. For that purpose, we compare the possible 
activity distribution predicted by our model for each location across the city with the 
actual data on the observed spending activity in POS at that location. This gives us 
ground-truth information on for instance, what types of economic activity are occur¬ 
ring in a city. However, spatial data (i.e., POI and POS data) distribution (see Figure 
[T4 | i is very uncertain and heterogeneous, and extracting meaningful information from 
it is a very challenging task. The data can be imprecise and not available in some areas. 
The heterogeneity of the data directly influences the confidence regarding how the data 
is characterized. We performed city-wide normalization of POI activity distribution to 
match that of POS, and then matched the different categories of human activities in 
the POI and POS data, which gives good accuracy and good precision both at the local 
scale and for different land-use types. 

In our analysis, POS data reflects spending activity on a fine-grained scale, eco¬ 
nomic activities, which are very important component of spending activity. The POI 
data, on the other hand, is coarse grained, and each of its points represent a single 
coarse-grained geographical object, not the objects within it. For example, POS ma¬ 
chines are associated with every store at the mall, and the type of POS represents a type 
of activity performed at the store. But in POI data, the mall could be represented as 
a single building of type “mall”, which does not contain any information about stores 
within the mall. This granularity difference causes a quantitative difference in the cat¬ 
egories of activities in between the POI and POS data. These differences could intro¬ 
duce a substantial bias in the model performance. To achieve comparable performance 
between the data sets, which would enable using POI data for predicting economic ac¬ 
tivity, we first need to remove the overall systematic bias of the data sets. We introduce 
a global normalization to make POIs and POSs consistent at the macro-level in order 
to account for possible heterogeneity in the representativity across different categories 
of activities in the two data sets. 

We generate the distribution of different human activities in POI and POS data as 
depicted in Figure [T^ where non-economic activities including educational, outdoor, 
residential, sporting, and working types of activity, have less of a value in the both dis¬ 
tributions of different activities in POI and POS data. An economic activity is a special 
but very important component of spending and here we hypothesize that POIs can be 
used to predict the economic activity after appropriate normalization (i.e., WpoiNorm) 
for the general data set representativity bias on the city scale. The result might also be 
associated with the fact that spending is just one of the components of human activity 
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Figure 14: POI and POS distribution in Barcelona 
0.45 



<D 


Figure 15: Overall economical activity distribution in each data set, 1) POI 2) POS 


and not all activities have something to do with spending. 

For each location / we have certain distribution of POIs among n categories of 
activity we consider with relative weights, Wpoi{a,l) and estimate the likelihood of 
human activities extracted from POIs starting with the centroid of location, considering 
the POIs in all locations within 100 m of the point, as we described in Section]^ Here 
we use a larger radius size in case there are not enough POIs within a considered circle: 
we expanded the aggregation radius by 25 m if fewer than 50 POPs and POSs were 
contained in intersecting locations, or until the radius reaches to 3.000 m. This also 
gives a sufficient number of POIs in remote areas. The aggregation radius selected in 
each location is presented in Figure [T6(a)| The length of each location size is processed 
in cumulative probability after the aggregation radius selection, as plotted in Figure 


16(b) So, 80% of total locations are increased to the length of 2,000 m. Figure 17(a) 
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shows the number of POIs and POSs at each location using the cumulative probability 
function. Across locations, the minimum, median, and maximum number of POl are 
18, 93, and 958, and are 3, 55, and 952, respectively, for POS. 

Now before comparing the values of Wpoi with Wpos, we look at the overall distri¬ 
butions first, (i.e., Wpoi{a^ c) vs Wpos {a, c) where c is the entire city). It might happen 
that those average relative weights are also not the same if there exists a systematic 
bias introduced by the way POS and POI are defined (for example, it could happen that 
each restaurant appears in POIs but only 50% of restaurants appear in POS (say, not 
all of them accept cards) and this situation could change from one category to another. 
The overall POI and POS distribution across the city is depicted in Figure where 
the global distributions are relatively similar for the type of activity labeled, eating. So, 
if this systematic bias exists, then it is no surprise that we will also see a difference for 
any particular location, which would reduce our reported accuracy. 

In order to account for this bias we consider the following normalization: 
WpoiNormia, 1) = and for each location we compare Wposia, 1) vs 

WpoiNormia, 0 (which is already normalized by the possible systematic bias between 
POS and POI distributions) instead of Wpoi{a,l). In each location, the bias between 
Wpos{a, 1) vs WpoiNorm{a, 0 compared by the Hellinger distance llJTll57ll58l mea¬ 

sure, which quantifies the similarity between two probability distributions and lies be¬ 
tween 0 and 1. So we call it as an estimate error between the two distribution : 

H D (yVpoiN orm ^ W^pos) — ( y/PFrir)?: ATormft, V) 'y/PPTjn.s (t, 0)^, where 1 

is an activity of n category in each location. 


1.000 



The number of POI 
in each location 


(a) Variation of aggregation radius in each location 



0 2000 4000 6000 

length of each location size 

(b) Valuation of the location length after the ag¬ 
gregation 


Figure 16 


The estimate eiTor between the distribution of activities in POI and POS data in a 
given location is called local estimate error, and the estimate error between the distri- 
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in each location 

(a) POI and POS distribution across the location 
in the aggregation radius 



0.2 0.4 0.6 


estimate error 

(b) The POI and POS normalized estimate error vs 
local estimate en'or 


Figure 17 


bution of activities in POInorm and POS in a given location is called POI normalized 
estimate error. The estimate error between the distribution of activities in POI and 
POSnorm in a given location is called POS normalized estimate error. The density 
distribution of local estimate errors, and the POI and POS normalized estimate errors 
are described by logistic distribution fit across all the locations within the city as de¬ 
picted in Figure [T7(b)| The peak (mode) of the POI normalized estimate error gives the 
smallest error values compared to the local and POS normalized estimate error values. 
The peak (mode) of the POI normalized estimate error distribution density falls at 0.18, 
which represents an 82% match in accuracy between the POI and POS data. Either way, 
POS normalized estimate error would give the same distribution, but the mode of the 
distribution is lower. On the other hand, the peak (mode) of the local estimate error 
distribution density falls at 0.40, which is around 60% accuracy on the POI and POS 
match. So if we make two distributions comparable by de-biasing them globally, the 
global estimate error is approximately reduced by 22%, which is the quantitative differ¬ 
ence between the POI and POS data sets. For the general data set representativity bias 
on the city scale (which might also be associated with the fact that spending is just one 
of the components of human activity and that not all activities involve spending). From 
the results of previous section, we note that POI data can be used as a proxy to predict 
human activities after appropriate normalization. We will show that once overall city 
normalization is performed, two data sets start to give comparable estimates for human 
activity at the local scale: this means that one needs to know only the overall bias on the 
city scale and then POI data can be applied to predict actual human activities. As POS 
reveals actual human economic activity, we propose to use POI to estimate economic 
activity when POS data is not available. 
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(a) The number of POIs in each random location (b) The number of POSs in each random location 
of different land-use types of different land-use types 

Figure 18; The number of POI and POS in each random location selected in different 
land-use types 




(a) Variation of aggregation radius selected of random (b) Variation of an overall estimated en'or of 
locations in different land use types economical activities prediction with respect to 

POI normalized and actual POS activities 


Figure 19 


As we have seen above, the estimate error distribution is of Gaussian type and on 
the left tail, the prediction percentage is better (around 84-95%), while the right tail 
is heavy tailed down to 20%. This could be an effect of land-use types, where the 
distribution of activities in POI and POS data is well described in some type of land 
areas and not in others. For instance, while POS data about economic-type activity, it is 
supposed to have strong match in railway, commercial and dense areas. Therefore, we 
investigate the POI normalized estimate error distribution in different land-use types, 
which are referenced by OSM: industrial, recreational, commercial, railway, retail, 
residential, and the central part of the city. We randomly selected up to 100 locations 
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(a) Differences in predicted levels of different 
types of activity across various locations vs ob¬ 
served activity levels in dense (the center of the 
city) land-use based on POS data 



(b) Differences in predicted levels of different 
types of activity across various locations vs ob¬ 
served activity levels in railway land-use based 
on POS data 


Figure 20 


in each land-use type, which were chosen in order to be consistent with the previous 
work in mi. The populated sufficient POIs and POSs in each location (see Figures 


18(a) 18(b) I using the aggregation radius (see Figure 19(a) i. We excluded the outlier 


locations whose aggregation radius was more than 1,000 m, which removed 9% of the 
total locations randomly selected. 

The POI normalized estimate error distribution in those land-use types is shown in 
Figure 19(b) where the railway, commercial, and dense areas show lower error values 
with the accuracy of (81-90%): this is to be expected, since we validated the POI data 
with the POS data, which is mostly related to shops. In the railway area, we got the 
lowest error values, with an accuracy of 88-90%. In a dense area (the center of the 
city), the POI normalized estimate error with the accuracy of (81-84%), and the error 
distribution is significantly heavy tailed as the area consists of a wide range of human 
activities. The commercial land-use is also well described with the accuracy of 86%, 
while the industrial and recreational ground types of land-use have lower accuracy than 
72%. In order to estimate the sensitive error values among the different categories of 
activities at those local distributions of POI and POS data, for instance, in dense and 
railway areas, we estimated the POI normalized percentage difference, which is an er¬ 
ror between WpoiNorm{a, 1) vs lFpos(a, 1) among the different categories of activities. 
The percentage difference is a difference between two values divided by the average 
of the two values, then PD{WpoiNorm{a, /), Wpos{a, 1)) = 




p{a.l) + Wpo,{a,l) ■ 

Figures 20(a) and 20(b)| show the POI normalized percentage difference among the 
different categories of activities in dense and railway areas. In dense areas, among the 
categories, the shopping activity has the lowest error (around 11-17%) in the distribu- 
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tion of activities between the POI normalized and POS as shops are mostly located in 
the center of the city. Also in railway areas, the shopping activity has the lowest error 
(around 3-7%) in the distribution of activities as we evaluated economic-type activities 
from POI and POS. The health and medical activity has an error value of almost 100%, 
as actions of this type are not usually described in POS data. However, while the analy¬ 
sis suffers from a lack of POIs and POSs information collected the model applicability 
error can be more meaningful for estimating the data loss. The representativity of the 
data sets by the global normalization at the city scale allows us to estimate how well 
POI data can predict human activity at the global scale, and the result was validated in 
land-use types (e.g., railway and dense land-uses). 

6 Possible developments (extensions) 

We presented in previous sections that the HRBModel was developed in order to infer 
human activities by extracting concepts and their relations from open geographical 
data. In this section, we briefly explain how the correlation between context and human 
behavior encoded in HRBModel, can be exploited to infer human behavior from CDR 
data. To this purpose, we enrich CDR with human activities, which are derived from 
our model through location and time. 



(a) regular coverage (b) irregular coverage 


Figure 21: Example of the two extreme cases of cell coverage area 

As we mentioned earlier, our main contribution of this article is to use background 
knowledge regarding human activities that could happen in the context where mobile 
network events occur iniiii. The most common mobile network antennas are GSM 
900 and GSM 1800. Each antenna covers a certain area for the support of mobile com¬ 
munication. Sometimes, cell coverage areas are not precisely defined, and they can 
be temporarily modified depending on the estimation of call traffic from/to an area. 
Usually, the size of the coverage areas is inversely proportional to the density of the 
population inhabiting the area. However, there is an upper bound of the coverage area 
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Figure 22: Visualization of the ranked activities in the cell coverage for Gardolo area 
of Trento 


size due to antennas physical limits to 35 km. Concerning the shape of the cell coverage 
areas one can observe that, when the territory is regular (i.e., flat with no mountains or 
other natural irregularities), the shape of cells can be approximated with convex poly¬ 
gons, but in presence of irregular territory with a lot of mountains and valleys, the 
region associated to a cell can be very irregular and possibly disconnected. Irregular 
coverage areas makes particularly difficult the task of estimating the calling point. In 
other cases, the shape of cells can be very irregular and possibly disconnected. An 
example of these two extreme cases are shown in Figure Irregular coverage areas 
make the task of estimating the calling location particularly difficult. Regular coverage 
is a contiguous area, and irregular coverage forms a less distinct area with many holes 
and irregularities. The HRBModel can provide qualitative description of human activ¬ 
ities associated with a likelihood measure, to each cell coverage area. For example, a 
cell coverage area of Gardolo in Trento city is characterized with the possible human 
activities ranked that could happen in that area, associated with its likelihood, which 
is illustrated in Figure]^ In this case, travel by transport, outdoor activity, and eating 
are higher ranked activities. 

Contextually enriched CDR can be exploited to analyze and evaluate call pattern 
(i.e., human mobility, communication and interaction patterns) associated to human ac¬ 
tivities for revealing existing relationships between human behaviors, and CDR stream. 
This can be done through a combination of the HRBModel and the bottom-up ap- 

* * A call frequency pattern is a quantitative model that describes the behavior of a communication in a 
certain class of contexts 


38 




















proaches in pattern extraction tasks used in the state of the art. The qualitative and 
quantitative properties in semantically enriched CDR allow us to generate training data 
with a certain accuracy that, together with other behavioral data coming from bank 
card transaction, social networks and so on, can be used as ground-truth information 
for training those supervised learning approaches for automatic activity inference in 
the area of mobile phone data analysis. 

Like many factors that influence human behavior, the CDR related to such behavior 
varies depending on the different situations. For example, these situations could involve 
a behavioral pattern of attending a football match on rainy day or sunny day, or doing 
so in the central area or in remote area. Another example of the contextual factors 
that influences human behaviors is a social event. For instance, human behavior is 
different during national or local holidays. While some people used to work during 
national holidays, others may have vacation plans to go somewhere. This concept has 
been successfully applied in ll42l where we created an event repository that collected 
data from the Ivory Coast city in order to analyze CDRs for the identification and 
characterization of human behavior patterns in the city. Event is an another key element 
that can be used to enrich the background knowledge of mobile phone data records. 

Therefore, the HRBModel is very useful for filling the missing data in spatial data- 
sources, where such information is unavailable and measuring the noise, scalability and 
accuracy among the spatial data-sources. For example, if economical type of activities 
are not sufficient enough when generated in bank card transaction data, we could com¬ 
bine the OSM data with bank card transaction data in order to better study economical 
or social and emerging types of activity MMMM at the macro level. The 
most directly applicable scenario includes, for instance, at the finest scale, a seman¬ 
tic enrichment of cell coverage areas, trajectories, and behavioral patterns (normal or 
exceptional), and at the broad scale, semantic interpretation of CDR traffic data more 
toward smart (intelligent) city research that gain better understanding of real-life cases, 
resulting in better classification of human behaviors in a region or territory. 

The HRBModel can be enabled for a further consideration of the GeoSparql exten- 
sion^in order to represent and querying of geospatial linked data, such as geo-objects 
and geo-actions for the Semantic Web. The geospatial RDF/OWL data which can sup¬ 
port both quantitative spatial reasoning on the Semantic Web and querying with the 
SPARQL database query language. Further, we enrich the HRBModel with other con¬ 
textual information like weather information, events and some other types of linked 
objects and data. This possibility to extend the model further helps us to refine human 
behavior recognition by considering more complex and complete contextual informa¬ 
tion (via linked open data) by employing real-time inference techniques some of which 

*^http://geosparql.org/ 
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could be learned and decoupled with the pre-processing of features from the raw data 
of mobile phones. 

7 Conclusion 

Even without taking a person’s environment into account, many factors influence the 
collective human behavior: the geographical information system is certainly only one 
of these factors. We proposed a model able to enrich mobile phone data records with 
the context of human activity. The qualitative terms of the human activities can be used 
for labeling the data in order to train a statistical model for automatic human behavior 
recognition tasks given heterogeneous and uncertain CDR data. By leveraging data 
from Web 2.0 as a source of contextual information for mobile data records, we study 
the possibility to infer human activities with a certain accuracy. Our model is a combi¬ 
nation of ontological and stochastic model, HRBModel, for predicting a set of human 
activities associated with a likelihood measure in a given context of mobile phone data 
records. Our study demonstrated that semantic expressiveness of human-environmental 
relationships could be investigated, in principal, based on diverse geographical infor¬ 
mation systems. The evaluation of the model was performed in two different cities and 
it demonstrates the applicability and scalability of the model under the conditions of 
heterogeneity and uncertainty of those spatial data sources like POI and POS data. Fur¬ 
thermore, this model takes various types of spatial and temporal factors into account. 
Based on our evaluations performed in two different cities, the prediction level depends 
to some extent on the granularity of the activity being predicted. The analysis shows 
that the level of activity prediction from the POI data is significant (around 84-95%). 
As we concentrated on the model evaluation for a specific economic type of activities 
using point of sale bank card spending data, we validated that within areas of differ¬ 
ent land-use types, the POI is a good proxy for predicting those economic activities 
with 81-90% accuracy in dense, railway, and commercial areas. This way the context 
of mobile network records can be used to deepen the context aware human activity 
recognition from the mobile network events. Hence, a number of challenges remain 
for future research and deeper investigations of human and environmental dynamics, 
such as behavior recognition and classification from the semantically enriched CDR. 
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